Banks and credit card companies calculate our credit score to determine our creditworthiness. It helps banks and credit card companies immediately to issue loans to customers with good creditworthiness. Today banks and credit card companies use Machine Learning algorithms to classify all the customers in their database based on their credit history.
There are three credit scores that banks and credit card companies use to label their customers:
Good
Standard
Poor
A person with a good credit score will get loans from any bank and financial institution. For the task of Credit Score Classification, we need a labelled dataset with credit scores.
I found an ideal dataset for this task labelled according to the credit history of credit card customers from kaggle
Data_Source: Kaggle -> https://www.kaggle.com/parisrohan
The project aims to analyze a dataset related to credit scores and financial behavior to gain insights into the factors that influence credit scores and financial health. The research questions to be analyzed include:
The dataset contains information about individuals' financial and credit-related attributes, such as annual income, monthly salary, number of bank accounts, credit card usage, loan details, credit history, and credit scores. The data has been collected through financial institutions and credit agencies, and it provides insights into the financial behavior and creditworthiness of individuals.
The dataset allows for the exploration of relationships between financial attributes and credit scores.
Limitations
I start this task of Credit score classification by importing the necessary python librariries and the dataset.
Pandas: As an open-source software library built on top of Python specifically for data manipulation and analysis, Pandas offers data structure and operations for powerful, flexible, and easy-to-use data analysis and manipulation.
Numpy: NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.
tqdm : The tqdm library to create a progress bar during the iteration through an iterable (e.g., a loop). This progress bar visually shows the progress of the loop, making it easier to understand how much of the iteration has been completed. The tqdm.auto module is used to automatically select the best implementation for the current environment. The progress bar appears with a description ("Processing" in this case) and updates in real-time as the loop progresses. It's a useful tool for monitoring the execution progress of tasks, especially when dealing with time-consuming operations.
matplotlib.pyplot: matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Plotly.express: Plotly Express is the easy-to-use, high-level interface to plotly, which operates on a variety of types of data & produces easy-to-style figures.Plotly Express provides functions to visualise a variety types of data. Most functions such as Px.bar or Px.
Plotly_graph_objects: The plotly. graph_objects module (typically imported as go ) contains an automatically-generated hierarchy of Python classes which represent non-leaf nodes in this figure schema. The term "graph objects" refers to instances of these classes. The primary classes defined in the plotly.
plotly.io: low-level interface for displaying, reading and writing figures. Return a copy of a figure where all styling properties have been moved into the figure's template. Convert a figure to an HTML string representation.
#!pip install plotly.express
# Let's start by loading the data and taking a high-level view of its structure, contents, and statistics.
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"
# Loading the dataset
data = pd.read_csv('train.csv')
#, encoding='ascii'
We can get a quick sense of the size of our dataset by using the shape method. This returns a tuple with the number of rows and columns in the dataset.
#Return number of rows and columns
data.shape
(100000, 28)
# Displaying the head and tail of the dataframe which show few rows from start and end of dataframe
data.head(10)
| ID | Customer_ID | Month | Name | Age | SSN | Occupation | Annual_Income | Monthly_Inhand_Salary | Num_Bank_Accounts | ... | Credit_Mix | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5634 | 3392 | 1 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 26.822620 | 265.0 | No | 49.574949 | 21.465380 | High_spent_Small_value_payments | 312.494089 | Good |
| 1 | 5635 | 3392 | 2 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 31.944960 | 266.0 | No | 49.574949 | 21.465380 | Low_spent_Large_value_payments | 284.629162 | Good |
| 2 | 5636 | 3392 | 3 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 28.609352 | 267.0 | No | 49.574949 | 21.465380 | Low_spent_Medium_value_payments | 331.209863 | Good |
| 3 | 5637 | 3392 | 4 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 31.377862 | 268.0 | No | 49.574949 | 21.465380 | Low_spent_Small_value_payments | 223.451310 | Good |
| 4 | 5638 | 3392 | 5 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 24.797347 | 269.0 | No | 49.574949 | 21.465380 | High_spent_Medium_value_payments | 341.489231 | Good |
| 5 | 5639 | 3392 | 6 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 27.262259 | 270.0 | No | 49.574949 | 21.465380 | High_spent_Medium_value_payments | 340.479212 | Good |
| 6 | 5640 | 3392 | 7 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 22.537593 | 271.0 | No | 49.574949 | 21.465380 | Low_spent_Small_value_payments | 244.565317 | Good |
| 7 | 5641 | 3392 | 8 | Aaron Maashoh | 23.0 | 821000265.0 | Scientist | 19114.12 | 1824.843333 | 3.0 | ... | Good | 809.98 | 23.933795 | 272.0 | No | 49.574949 | 21.465380 | High_spent_Medium_value_payments | 358.124168 | Standard |
| 8 | 5646 | 8625 | 1 | Rick Rothackerj | 28.0 | 4075839.0 | Teacher | 34847.84 | 3037.986667 | 2.0 | ... | Good | 605.03 | 24.464031 | 319.0 | No | 18.816215 | 39.684018 | Low_spent_Small_value_payments | 470.690627 | Standard |
| 9 | 5647 | 8625 | 2 | Rick Rothackerj | 28.0 | 4075839.0 | Teacher | 34847.84 | 3037.986667 | 2.0 | ... | Good | 605.03 | 38.550848 | 320.0 | No | 18.816215 | 39.684018 | High_spent_Large_value_payments | 484.591214 | Good |
10 rows × 28 columns
data.tail(10)
| ID | Customer_ID | Month | Name | Age | SSN | Occupation | Annual_Income | Monthly_Inhand_Salary | Num_Bank_Accounts | ... | Credit_Mix | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99990 | 155616 | 34304 | 7 | Sarah McBridec | 28.0 | 31350942.0 | Architect | 20002.88 | 1929.906667 | 10.0 | ... | Bad | 3571.70 | 25.123535 | 74.0 | Yes | 60.964772 | 34.662906 | Low_spent_Large_value_payments | 228.750392 | Standard |
| 99991 | 155617 | 34304 | 8 | Sarah McBridec | 29.0 | 31350942.0 | Architect | 20002.88 | 1929.906667 | 10.0 | ... | Bad | 3571.70 | 37.140784 | 75.0 | Yes | 60.964772 | 34.662906 | High_spent_Large_value_payments | 337.362988 | Standard |
| 99992 | 155622 | 37932 | 1 | Nicks | 24.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 32.991333 | 375.0 | No | 35.104023 | 24.028477 | Low_spent_Small_value_payments | 189.641080 | Poor |
| 99993 | 155623 | 37932 | 2 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 29.135447 | 376.0 | No | 35.104023 | 24.028477 | Low_spent_Medium_value_payments | 400.104466 | Standard |
| 99994 | 155624 | 37932 | 3 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 39.323569 | 377.0 | No | 35.104023 | 24.028477 | High_spent_Medium_value_payments | 410.256158 | Poor |
| 99995 | 155625 | 37932 | 4 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 34.663572 | 378.0 | No | 35.104023 | 24.028477 | High_spent_Large_value_payments | 479.866228 | Poor |
| 99996 | 155626 | 37932 | 5 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 40.565631 | 379.0 | No | 35.104023 | 24.028477 | High_spent_Medium_value_payments | 496.651610 | Poor |
| 99997 | 155627 | 37932 | 6 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 41.255522 | 380.0 | No | 35.104023 | 24.028477 | High_spent_Large_value_payments | 516.809083 | Poor |
| 99998 | 155628 | 37932 | 7 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 33.638208 | 381.0 | No | 35.104023 | 24.028477 | Low_spent_Large_value_payments | 319.164979 | Standard |
| 99999 | 155629 | 37932 | 8 | Nicks | 25.0 | 78735990.0 | Mechanic | 39628.99 | 3359.415833 | 4.0 | ... | Good | 502.38 | 34.192463 | 382.0 | No | 35.104023 | 24.028477 | High_spent_Medium_value_payments | 393.673696 | Poor |
10 rows × 28 columns
# data.info(): This method returns the information about the dataframe including index dtype and columns,
#non nulls values and memory usage.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 100000 non-null int64 1 Customer_ID 100000 non-null int64 2 Month 100000 non-null int64 3 Name 100000 non-null object 4 Age 100000 non-null float64 5 SSN 100000 non-null float64 6 Occupation 100000 non-null object 7 Annual_Income 100000 non-null float64 8 Monthly_Inhand_Salary 100000 non-null float64 9 Num_Bank_Accounts 100000 non-null float64 10 Num_Credit_Card 100000 non-null float64 11 Interest_Rate 100000 non-null float64 12 Num_of_Loan 100000 non-null float64 13 Type_of_Loan 100000 non-null object 14 Delay_from_due_date 100000 non-null float64 15 Num_of_Delayed_Payment 100000 non-null float64 16 Changed_Credit_Limit 100000 non-null float64 17 Num_Credit_Inquiries 100000 non-null float64 18 Credit_Mix 100000 non-null object 19 Outstanding_Debt 100000 non-null float64 20 Credit_Utilization_Ratio 100000 non-null float64 21 Credit_History_Age 100000 non-null float64 22 Payment_of_Min_Amount 100000 non-null object 23 Total_EMI_per_month 100000 non-null float64 24 Amount_invested_monthly 100000 non-null float64 25 Payment_Behaviour 100000 non-null object 26 Monthly_Balance 100000 non-null float64 27 Credit_Score 100000 non-null object dtypes: float64(18), int64(3), object(7) memory usage: 21.4+ MB
# Displaying summary statistics : The describe() function computes a summary of statistics pertaining to the DataFrame columns.
#This function gives the mean, std and IQR values.
print(data.describe())
ID Customer_ID Month Age \
count 100000.000000 100000.000000 100000.000000 100000.000000
mean 80631.500000 25982.666640 4.500000 33.316340
std 43301.486619 14340.543051 2.291299 10.764812
min 5634.000000 1006.000000 1.000000 14.000000
25% 43132.750000 13664.500000 2.750000 24.000000
50% 80631.500000 25777.000000 4.500000 33.000000
75% 118130.250000 38385.000000 6.250000 42.000000
max 155629.000000 50999.000000 8.000000 56.000000
SSN Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts \
count 1.000000e+05 100000.000000 100000.000000 100000.000000
mean 5.004617e+08 50505.123449 4197.270835 5.368820
std 2.908267e+08 38299.422093 3186.432497 2.593314
min 8.134900e+04 7005.930000 303.645417 0.000000
25% 2.451686e+08 19342.972500 1626.594167 3.000000
50% 5.006886e+08 36999.705000 3095.905000 5.000000
75% 7.560027e+08 71683.470000 5957.715000 7.000000
max 9.999934e+08 179987.280000 15204.633333 11.000000
Num_Credit_Card Interest_Rate ... Delay_from_due_date \
count 100000.000000 100000.00000 ... 100000.00000
mean 5.533570 14.53208 ... 21.08141
std 2.067098 8.74133 ... 14.80456
min 0.000000 1.00000 ... 0.00000
25% 4.000000 7.00000 ... 10.00000
50% 5.000000 13.00000 ... 18.00000
75% 7.000000 20.00000 ... 28.00000
max 11.000000 34.00000 ... 62.00000
Num_of_Delayed_Payment Changed_Credit_Limit Num_Credit_Inquiries \
count 100000.000000 100000.000000 100000.000000
mean 13.313120 10.470323 5.798250
std 6.237166 6.609481 3.867826
min 0.000000 0.500000 0.000000
25% 9.000000 5.380000 3.000000
50% 14.000000 9.400000 5.000000
75% 18.000000 14.850000 8.000000
max 25.000000 29.980000 17.000000
Outstanding_Debt Credit_Utilization_Ratio Credit_History_Age \
count 100000.000000 100000.000000 100000.000000
mean 1426.220376 32.285173 221.220460
std 1155.129026 5.116875 99.680716
min 0.230000 20.000000 1.000000
25% 566.072500 28.052567 144.000000
50% 1166.155000 32.305784 219.000000
75% 1945.962500 36.496663 302.000000
max 4998.070000 50.000000 404.000000
Total_EMI_per_month Amount_invested_monthly Monthly_Balance
count 100000.000000 100000.000000 100000.000000
mean 107.699208 55.101315 392.697586
std 132.267056 39.006932 201.652719
min 0.000000 0.000000 0.007760
25% 29.268886 27.959111 267.615983
50% 66.462304 45.156550 333.865366
75% 147.392573 71.295797 463.215683
max 1779.103254 434.191089 1183.930696
[8 rows x 21 columns]
data.dtypes: A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.)
data.dtypes
ID int64 Customer_ID int64 Month int64 Name object Age float64 SSN float64 Occupation object Annual_Income float64 Monthly_Inhand_Salary float64 Num_Bank_Accounts float64 Num_Credit_Card float64 Interest_Rate float64 Num_of_Loan float64 Type_of_Loan object Delay_from_due_date float64 Num_of_Delayed_Payment float64 Changed_Credit_Limit float64 Num_Credit_Inquiries float64 Credit_Mix object Outstanding_Debt float64 Credit_Utilization_Ratio float64 Credit_History_Age float64 Payment_of_Min_Amount object Total_EMI_per_month float64 Amount_invested_monthly float64 Payment_Behaviour object Monthly_Balance float64 Credit_Score object dtype: object
# To count the number of null values in a Pandas DataFrame, we can use the isnull() method to create a Boolean mask
# and then use the sum() method to count the number of True values.
print(data.isnull().sum())
ID 0 Customer_ID 0 Month 0 Name 0 Age 0 SSN 0 Occupation 0 Annual_Income 0 Monthly_Inhand_Salary 0 Num_Bank_Accounts 0 Num_Credit_Card 0 Interest_Rate 0 Num_of_Loan 0 Type_of_Loan 0 Delay_from_due_date 0 Num_of_Delayed_Payment 0 Changed_Credit_Limit 0 Num_Credit_Inquiries 0 Credit_Mix 0 Outstanding_Debt 0 Credit_Utilization_Ratio 0 Credit_History_Age 0 Payment_of_Min_Amount 0 Total_EMI_per_month 0 Amount_invested_monthly 0 Payment_Behaviour 0 Monthly_Balance 0 Credit_Score 0 dtype: int64
The value_counts() function returns a Series that contain counts of unique values. It returns an object that will be in descending order so that its first element will be the most frequently-occurred element. By default, it excludes NA values.
data["Credit_Score"].value_counts()
Standard 53174 Poor 28998 Good 17828 Name: Credit_Score, dtype: int64
# Ploting distributions of a few key columns
# Selecting a few columns for distribution plots
columns_to_plot = ['Age', 'Annual_Income', 'Credit_Score']
# Plotting distributions
for column in columns_to_plot:
plt.figure(figsize=(10, 5))
sns.histplot(data[column].dropna(), kde=True)
plt.title('Distribution of ' + column)
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()
# I will start with a heatmap to understand the correlations between different features. Then,
#I will create box plots to detect outliers in key features.
# Calculating correlations
correlation_matrix = data.corr()
# Ploting the correlation matrix to identify key factors impacting credit scores
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
/tmp/ipykernel_1771/1018143902.py:6: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = data.corr()
The correlation matrix provides insights into the relationships between different variables and credit scores. This visualization helps identify the key factors that impact credit scores.
The dataset has many features that can train a Machine Learning model for credit score classification. Let’s explore all the features one by one.
I will start by exploring the occupation feature to know if the occupation of the person affects credit scores:
fig = px.box(data,
x="Occupation",
color="Credit_Score",
title="Credit Scores Based on Occupation",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.show()
There’s not much difference in the credit scores of all occupations mentioned in the data. Now i will explore whether the Annual Income of the person impacts our credit scores or not:
fig = px.box(data,
x="Credit_Score",
y="Annual_Income",
color="Credit_Score",
title="Credit Scores Based on Annual Income",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
According to the above visualization, the more you earn annually, the better your credit score is. exploring whether the monthly in-hand salary impacts credit scores or not:
fig = px.box(data,
x="Credit_Score",
y="Monthly_Inhand_Salary",
color="Credit_Score",
title="Credit Scores Based on Monthly Inhand Salary",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
Like annual income, the more monthly in-hand salary you earn, the better your credit score will become. now, checking if having more bank accounts impacts credit scores or not:
fig = px.box(data,
x="Credit_Score",
y="Num_Bank_Accounts",
color="Credit_Score",
title="Credit Scores Based on Number of Bank Accounts",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
Maintaining more than five accounts is not good for having a good credit score. A person should have 2 – 3 bank accounts only. So having more bank accounts doesn’t positively impact credit scores. Now i will cjeck if the impact on credit scores based on the number of credit cards we have:
fig = px.box(data,
x="Credit_Score",
y="Num_Credit_Card",
color="Credit_Score",
title="Credit Scores Based on Number of Credit cards",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
Just like the number of bank accounts, having more credit cards will not positively impact your credit scores. Having 3 – 5 credit cards is good for your credit score. checking the impact on credit scores based on how much average interest you pay on loans and EMIs:
fig = px.box(data,
x="Credit_Score",
y="Interest_Rate",
color="Credit_Score",
title="Credit Scores Based on the Average Interest rates",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
If the average interest rate is 4 – 11%, the credit score is good. Having an average interest rate of more than 15% is bad for your credit scores. Now let’s see how many loans we can take at a time for a good credit score:
fig = px.box(data,
x="Credit_Score",
y="Num_of_Loan",
color="Credit_Score",
title="Credit Scores Based on Number of Loans Taken by the Person",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
To have a good credit score, you should not take more than 1 – 3 loans at a time. Having more than three loans at a time will negatively impact your credit scores.
# if our monthly investments affect our credit scores or not:
fig = px.box(data,
x="Credit_Score",
y="Amount_invested_monthly",
color="Credit_Score",
title="Credit Scores Based on Amount Invested Monthly",
color_discrete_map={'Poor':'red',
'Standard':'yellow',
'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()
The amount of money you invest monthly doesn’t affect your credit scores a lot.
# Feature Analysis and Engineering
# Let's start by checking for missing values in the dataset.
missing_values = data.isnull().sum()
print('Missing values in each column:\n', missing_values)
Missing values in each column: ID 0 Customer_ID 0 Month 0 Name 0 Age 0 SSN 0 Occupation 0 Annual_Income 0 Monthly_Inhand_Salary 0 Num_Bank_Accounts 0 Num_Credit_Card 0 Interest_Rate 0 Num_of_Loan 0 Type_of_Loan 0 Delay_from_due_date 0 Num_of_Delayed_Payment 0 Changed_Credit_Limit 0 Num_Credit_Inquiries 0 Credit_Mix 0 Outstanding_Debt 0 Credit_Utilization_Ratio 0 Credit_History_Age 0 Payment_of_Min_Amount 0 Total_EMI_per_month 0 Amount_invested_monthly 0 Payment_Behaviour 0 Monthly_Balance 0 Credit_Score 0 Debt_Income_Ratio 0 dtype: int64
# We will also check the unique values for categorical columns to decide on encoding strategies.
categorical_columns = data.select_dtypes(include=['object']).columns
for col in categorical_columns:
print(f'Unique values in {col}:', data[col].nunique())
Unique values in Name: 10128 Unique values in Occupation: 15 Unique values in Type_of_Loan: 6261 Unique values in Credit_Mix: 3 Unique values in Payment_of_Min_Amount: 3 Unique values in Payment_Behaviour: 6 Unique values in Credit_Score: 3
# For the purpose of feature engineering, creating a new feature as an example.
# I am creating a feature that represents the ratio of outstanding debt to annual income.
data['Debt_Income_Ratio'] = data['Outstanding_Debt'] / data['Annual_Income']
# Displaying the head of the dataframe to show the new feature.
print(data[['Outstanding_Debt', 'Annual_Income', 'Debt_Income_Ratio']].head())
Outstanding_Debt Annual_Income Debt_Income_Ratio 0 809.98 19114.12 0.042376 1 809.98 19114.12 0.042376 2 809.98 19114.12 0.042376 3 809.98 19114.12 0.042376 4 809.98 19114.12 0.042376
# Correcting the display of the new feature and handling missing values
# It seems there was an issue with displaying the new feature. Let's fix that.
# We will also handle missing values if there are any.
# First, let's fill any missing values in 'Outstanding_Debt' and 'Annual_Income' with the median of the column
# as an example of handling missing values.
data['Outstanding_Debt'] = data['Outstanding_Debt'].fillna(data['Outstanding_Debt'].median())
data['Annual_Income'] = data['Annual_Income'].fillna(data['Annual_Income'].median())
# Now let's recreate the 'Debt_Income_Ratio' feature
data['Debt_Income_Ratio'] = data['Outstanding_Debt'] / data['Annual_Income']
# Display the head of the dataframe to show the new feature along with 'Outstanding_Debt' and 'Annual_Income'.
print(data[['Outstanding_Debt', 'Annual_Income', 'Debt_Income_Ratio']].head())
Outstanding_Debt Annual_Income Debt_Income_Ratio 0 809.98 19114.12 0.042376 1 809.98 19114.12 0.042376 2 809.98 19114.12 0.042376 3 809.98 19114.12 0.042376 4 809.98 19114.12 0.042376
The 'Debt_Income_Ratio' feature has been successfully created and added to the dataset. This feature represents the ratio of outstanding debt to annual income, which could be a useful indicator for financial health and creditworthiness.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
# Encode the target variable
label_encoder = LabelEncoder()
data['Credit_Score_encoded'] = label_encoder.fit_transform(data['Credit_Score'])
# Select only numeric columns for features
numeric_columns = data.select_dtypes(include=['number']).columns
X = data[numeric_columns].drop(columns=['Credit_Score_encoded'])
y = data['Credit_Score_encoded']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
# Predict and calculate accuracy
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Output the non-numeric columns and the accuracy
print('Non-numeric columns:', data.select_dtypes(exclude=['number']).columns.tolist())
print('Accuracy of the RandomForestClassifier:', accuracy)
Non-numeric columns: ['Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour', 'Credit_Score'] Accuracy of the RandomForestClassifier: 0.83815
Making predictions on the testing data and calculating the accuracy of the model using the predicted and actual values.
The non-numeric columns identified in the dataset are 'Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour', and 'Credit_Score'. These were excluded from the feature set used to train the RandomForestClassifier.
After training the model with the remaining numeric data and making predictions on the test set, the accuracy of the RandomForestClassifier is approximately 83.81% /approx(84%). This is a more realistic accuracy score compared to the previously reported perfect score(79%), suggesting that the model is now providing a more credible evaluation of its predictive performance.
The conclusion is that i have successfully trained a Random Forest classifier to predict credit scores based on the selected numeric features. The model's accuracy has been evaluated, and it can be used to make predictions on new data.
Overall, the dataset has been processed, the model has been trained, and its performance has been assessed. This concludes the project, and the trained model can now be used for credit score prediction.